Skip to content

grpc: set MaxConcurrentStreams to avoid sudden traffic spikes that lead to PD OOM#8977

Open
okJiang wants to merge 1 commit intotikv:masterfrom
okJiang:set-maxconcurrency
Open

grpc: set MaxConcurrentStreams to avoid sudden traffic spikes that lead to PD OOM#8977
okJiang wants to merge 1 commit intotikv:masterfrom
okJiang:set-maxconcurrency

Conversation

@okJiang
Copy link
Copy Markdown
Member

@okJiang okJiang commented Jan 6, 2025

What problem does this PR solve?

Issue Number: Close #8882, ref #4480

What is changed and how does it work?

Added MaxConcurrentStreams to limit the request concurrency. This is a self-protection mechanism of PD. Once the number of concurrent requests reaches the limit, gRPC will wait for the ongoing requests to finish and allocate resources to the waiting requests. Therefore, in this situation, the request time may slow down, but it will provide better robustness.

PS: This parameter cannot be modified during runtime, so we cannot support changing it by pd-ctl. If you think it should be configurable, please comment.

Check List

Tests

  • Unit test
  • Integration test

Release note

None.

Signed-off-by: okJiang <819421878@qq.com>
@ti-chi-bot ti-chi-bot Bot added release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has signed the dco. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 6, 2025
@codecov
Copy link
Copy Markdown

codecov Bot commented Jan 6, 2025

Codecov Report

❌ Patch coverage is 57.14286% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.61%. Comparing base (41919ad) to head (f9a16f4).
⚠️ Report is 545 commits behind head on master.

❌ Your patch check has failed because the patch coverage (57.14%) is below the target coverage (74.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8977      +/-   ##
==========================================
+ Coverage   76.31%   77.61%   +1.30%     
==========================================
  Files         465      532      +67     
  Lines       70547    94134   +23587     
==========================================
+ Hits        53839    73065   +19226     
- Misses      13361    17187    +3826     
- Partials     3347     3882     +535     
Flag Coverage Δ
unittests 77.61% <57.14%> (∅)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread server/config/config.go
defaultGCTunerThreshold = 0.6
minGCTunerThreshold = 0
maxGCTunerThreshold = 0.9
// If concurrentStreams reaches 600k, the memory usage is about 40GB. To
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any tests to support the conclusion?

Copy link
Copy Markdown
Member Author

@okJiang okJiang Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conclusion comes from a practical case where the cluster contains millions of regions. And its requests are ScanRegions mainly.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMIIW, it only limits the concurrency for one connection, not total concurrency on the server side.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In simple terms, should we consider this sudden surge in traffic as an anomaly? If so, I think we can set this parameter to protect PD. Do you know in what normal circumstances PD would experience such high traffic? For example, 16GB PD and 160k requests.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/etcd-io/etcd/blob/fce823ac2830033270f8fe03fa1b56e62bf882b8/server/embed/config.go#L230-L232

// MaxConcurrentStreams specifies the maximum number of concurrent
// streams that each client can open at a time.

If users make massive requests using multiple clients, we have no way to limit it...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it applicable to unary and stream?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This is a gRPC/HTTP2 transport-level limit, so it applies to both unary RPCs and streaming RPCs.

  • Unary: each in-flight unary call occupies one stream until the call finishes.
  • Stream: each open client/server/bidi stream occupies one stream for its whole lifetime.
  • Scope: it limits concurrent streams per client connection (server transport), not the total concurrency across all clients.

So it can help on both paths, but for streaming it only limits the number of open streams, not the number of messages within one stream. I think this thread is about clarifying the behavior, so no code change is needed here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rechecked: yes, it applies to both unary RPCs and streaming RPCs, because this limit is enforced at the gRPC/HTTP2 stream layer per client connection.

  • Unary RPC: one in-flight call uses one stream until it completes.
  • Streaming RPC: one open client/server/bidi stream holds one stream for its lifetime.
  • It does not limit the total concurrency across multiple client connections.

So the knob is effective for both unary and stream requests, but for streaming it limits the number of open streams rather than the number of messages sent on one stream. No code change is needed for this clarification.

@okJiang
Copy link
Copy Markdown
Member Author

okJiang commented Feb 8, 2025

No more update and comment, close it now.

@okJiang okJiang closed this Feb 8, 2025
@okJiang okJiang reopened this Aug 7, 2025
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Aug 7, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign benmeadowcroft for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@okJiang
Copy link
Copy Markdown
Member Author

okJiang commented Aug 7, 2025

/cc @lhy1024

@ti-chi-bot ti-chi-bot Bot requested a review from lhy1024 August 7, 2025 02:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has signed the dco. release-note-none Denotes a PR that doesn't merit a release note. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

A large volume of requests combined with slow response time will significantly increase the memory usage of the PD

3 participants